Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue-2969: add hive reconnect time counter #2970

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

yegorskii
Copy link
Collaborator

@yegorskii yegorskii added large-tests Launch large tests for PR blockstore Add this label to run only cloud/blockstore build and tests on PR filestore Add this label to run only cloud/filestore build and tests on PR labels Feb 3, 2025
Copy link
Contributor

github-actions bot commented Feb 3, 2025

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit d075141.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
5589 5589 0 0 0 0

} else if (msg->ClientId == HiveClientId && ReconnectStartTime)
{
if (ReconnectPipeCounter) {
ReconnectPipeCounter->Add(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

этот счётчик показывает суммарное время всех реконнектов к Hive?

точно ли это полезная информация?

может считать среднее время реконнекта?

или перцентили

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Реконнект всего один(hive-то один и pipe client по факту тоже только один) при разрыве. Мы откладываем время только после восстановления соединения. Не вижу смысла в гистограммах или перцентилях пока

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Но по факту да, суммарное время реконнектов за 15 секунд

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Нам это нужно что тыкать этим в YDB, когда hive слишком долго поднимается. Обсуждали с ними, такой метрики им достаточно

@@ -307,6 +340,12 @@ void THiveProxyActor::HandleConnectionError(

for (const auto& actorId: states->Actors) {
auto clientId = ClientCache->Prepare(ctx, hive);
if (!HiveClientId) {
HiveClientId = clientId;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

зачем нужен HiveClientId?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Чтобы понять что текущий сдох после разрыва соединения. Судя по коду UnboundClientCache если клиент получает сообщение о разьединении то его просто убивают

@@ -177,14 +177,28 @@ class THiveProxyActor final

const ui64 TenantHiveTabletId;

const NMonitoring::TDynamicCounterPtr CountersRoot;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

просто Counters

@@ -177,14 +177,28 @@ class THiveProxyActor final

const ui64 TenantHiveTabletId;

const NMonitoring::TDynamicCounterPtr CountersRoot;
NMonitoring::TDynamicCounters::TCounterPtr ReconnectPipeCounter;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveReconnectTimeCounter

@@ -177,14 +177,28 @@ class THiveProxyActor final

const ui64 TenantHiveTabletId;

const NMonitoring::TDynamicCounterPtr CountersRoot;
NMonitoring::TDynamicCounters::TCounterPtr ReconnectPipeCounter;
ui64 ReconnectStartTime = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveReconnectStartTime

@@ -216,6 +239,13 @@ void THiveProxyActor::HandleConnect(
auto error = MakeKikimrError(msg->Status, TStringBuilder()
<< "Connect to hive " << hive << " failed");
HandleConnectionError(ctx, error, hive, true);
} else if (msg->ClientId == HiveClientId && ReconnectStartTime)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

зачем понадобился HiveClientId, почему не хватает проверки ReconnectStartTime?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Чтобы подстраховаться от возможных гонок

namespace NCloud::NStorage {

////////////////////////////////////////////////////////////////////////////////

NActors::IActorPtr CreateHiveProxy(THiveProxyConfig config);
NActors::IActorPtr CreateHiveProxy(
THiveProxyConfig config,
NMonitoring::TDynamicCounterPtr CountersRoot);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counters

Comment on lines 25 to 27
if (config.FallbackMode) {
return std::make_unique<THiveProxyFallbackActor>(std::move(config));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

копипаста

Copy link
Contributor

github-actions bot commented Feb 4, 2025

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 8baac1c.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
5589 5589 0 0 0 0

HiveReconnectTimeCounter->Add(
CyclesToDuration(GetCycleCount() - HiveReconnectStartTime).MicroSeconds());
}
HiveReconnectStartTime = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIveReconnectStartCycles

Comment on lines 1573 to 1577
for (int retries = 0; retries < 5 && !hiveLockRequests; ++retries) {
// Pipe to hive may take a long time to connect
// Wait until hive receives the lock request
runtime.DispatchEvents(TDispatchOptions(), TDuration::Seconds(1));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

лучше вечный цикл

Copy link
Contributor

github-actions bot commented Feb 5, 2025

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit e1ff782.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
5589 5589 0 0 0 0

@@ -216,6 +237,14 @@ void THiveProxyActor::HandleConnect(
auto error = MakeKikimrError(msg->Status, TStringBuilder()
<< "Connect to hive " << hive << " failed");
HandleConnectionError(ctx, error, hive, true);
} else if (HiveReconnectStartCycles)
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

вроде не переносим скобку в этом случае

{
if (HiveReconnectTimeCounter) {
HiveReconnectTimeCounter->Add(
CyclesToDuration(GetCycleCount() - HiveReconnectStartCycles).MicroSeconds());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

80 символов

Comment on lines +516 to +517
CreateHiveProxy(
std::move(config), Runtime.GetAppData(0).Counters).release());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

форматирование

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blockstore Add this label to run only cloud/blockstore build and tests on PR filestore Add this label to run only cloud/filestore build and tests on PR large-tests Launch large tests for PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants